Word Segmentation for Document Images by Successively Merging Adjacent Character Bounding Boxes by Iterative Dilation
نویسنده
چکیده
A new method of word segmentation for document images is presented. The method uses the bounding box regions to enclose the letters (characters) of the words and then the resulting letter spaces are progressively filled to merge the character bounding boxes to get the word bounding boxes. The method holds good for inclined and irregularly distributed words. The proposed method completely avoids the line segmentation process which normally precedes word segmentation in traditional methods. Keywords— Bounding boxes, Connected components, Horizontal Dilation, Character spacing, Word bounding boxes, Word segmentation, Word spacing.
منابع مشابه
Extraction of text lines and text blocks on document images based on statistical modeling
In this article, we developed a Bayesian model to characterize text line and text block structures on document images using the text word bounding boxes. We posed the extraction problem as finding the text lines and text blocks that maximize the Bayesian probability of the text lines and text blocks given the text word bounding boxes. In particular, we derived the so-called probabilistic linear...
متن کاملA Font and Size Independent Content Based Retrieval System for Kannada Document Images
This paper presents a Content based image retrieval system for Kannada Document images. Given a query word, the system returns the documents in the database in which there is a similar word, with the word highlighted. The retrieval works for Kannada document images which have different font sizes and styles. First the scanned Kannada document images are preprocessed to reduce image noise. Then ...
متن کاملChip Refinement Character Recognition Text Clean - up I 2 Segmentation Texture Segmentation Texture Segmentation Texture Segmentation Texture Generation
There are many applications in which the automatic detection and recognition of text embedded in images is useful. These applications include multimedia systems, digital libraries, and Geographical Information Systems. When machine generated text is printed against clean backgrounds, it can be converted to a computer readble form (ASCII) using current Optical Character Recognition (OCR) technol...
متن کاملWord Spotting in Chinese Document Images without Layout Analysis
An approach to searching user-specified words/phrases in Chinese document images, without the requirements of layout analysis, is proposed in this paper. Bounding boxes of Chinese character images are first determined using connected component analysis. Next, a suitable character from the user-specified word/phrase is chosen as the initial character to search for a matching candidate in the doc...
متن کاملPersian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کامل